If an application needs to do more than simply export OCR results to a searchable text or PDF document, then the Internal Structured Data generated by a call to recognizeToMemory() can be accessed through iterators generated via a set of “Get” functions. The Get functions and the iterators, as a whole, are referred to as the Results Manager. Everything from a Document to a Character is a subclass of the Result object. The term, result element, is used to refer to any document, page, region, text block, text line, word, or character.
In order to use the Results Manager, it is beneficial to understand the relationship between the different layers of the Internal Structured Data. (See figures 1 and 2 in OCR Xpress for Java Functionality)
- The top most layer is called the document. A document is a set of one or more pages.
- Each page is the captured results of a single image and is broken down into one or more Regions.
- The regions may be segmented along a number of parameters.
For example, the character height of all the characters in one region may be different than the height of the characters in another region. In this case, the two regions were segmented by the character height parameter.
In most cases, there will only be one region per page.
- Each region is broken down into text blocks (usually paragraphs).
Note that for Regions and Text Blocks there are no hard rules for segmenting them. There are many elements of an image that can affect how the two layers get segmented. Their primary use is for generalized segmentation or grouping of the more important sub-layers; Text Lines, Words, and Characters.
- Each Text Block will contain one or more Text Lines.
- Each Text Line will contain one or more Words.
- Each Word will contain one or more Characters.
The Results Manager allows applications to access the layers of the Internal Structured Data in a coherent manner. In short, applications can interrogate every item in every sub-layer of a page, every item in every sub-layer of a text block, and so on. Every result element can be interrogated for its content, its position in the image, or even the confidence that the OCR result is correct.
A particular result (e.g., a word) can be accessed in one of the following ways:
- Hierarchically, by traversing through the intermediate elements.
For example, to access a particular word, you would iteratively traverse the following route: Document->Pages->Regions->TextLines->Words->Word
- Directly, by getting the words from any parent result.
For example, you can get a word directly from a page without needing to traverse the region, text block, and text line in between.